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Abstract 

Drawing a sample from a discrete distribution is 
one of the building components for Monte Carlo 
methods. Like other sampling algorithms, dis¬ 
crete sampling suffers from the high computa¬ 
tional burden in large-scale inference problems. 

We study the problem of sampling a discrete 
random variable with a high degree of depen¬ 
dency that is typical in large-scale Bayesian in¬ 
ference and graphical models, and propose an ef¬ 
ficient approximate solution with a subsampling 
approach. We make a novel connection between 
the discrete sampling and Multi-Armed Bandits 
problems with a finite reward population and pro¬ 
vide three algorithms with theoretical guarantees. 
Empirical evaluations show the robustness and 
efficiency of the approximate algorithms in both 
synthetic and real-world large-scale problems. 

1. Introduction 

Sampling a random variable from a discrete (conditional) 
distribution is one of the core operations in Monte Carlo 
methods. It is an ubiquitous and often necessary compo¬ 
nent for inference algorithms such as Gibbs sampling and 
particle filtering. Applying discrete sampling for large- 
scale problems has been a challenging task like other 
Monte Carlo algorithms due to the high computational 
burden. Various approaches have been proposed to ad¬ 
dress different dimensions of “large scales”. For example, 
distributed algorithms have been used to sample a model 
with a large number of discrete variables (Newman et ah, 
2009; Bratires et ah, 2010; Wu et ah, 2011), smart transi¬ 
tion kernels were described for Markov chain Monte Carlo 
(MCMC) algorithms to sample efficiently a single variable 
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with a large or even infinite state space (Li et ah, 2014; 
Kalli et ah, 2011). This paper is focused on another dimen¬ 
sion of the “large-scales” where the variable to sample has 
a large degree of statistical dependency. 

Consider a random variable with a finite domain X £ X 
and a distribution in the following form 

N 

p(X = x) oc p(X = x), with p(X = x) = /o(x) f n (x), 

n— 1 

(i) 

where f n can be any function of x. Such distribution occurs 
frequently in machine learning problems. For example, in 
Bayesian inference for a model with parameter X and N 
i.i.d. observations T> = {y„}^ = i. the posterior distribu¬ 
tion of X depends on all the observations when sufficient 
statistics is not available. The unnormalized posterior dis¬ 
tribution can be written asp(X\V) = p(X) JliLi p(yi|X). 

In undirected graphical model inference problems where 
a node X, t appears in N potential functions, the distribu¬ 
tion of X, depends on the value of all of the N functions. 
The unnormalized conditional distribution is p(Xj|x_j) = 
n„=i 4>n(Xi, x_j), where x_, denotes the value of all the 
other nodes in the graph and f r , denotes a potential func¬ 
tion that includes A',; in the scope. In this paper we study 
how to sample a discrete random variable X in a manner 
that is scalable in N. 

A common approach to address the big data problem is 
divide-and-conquer that uses parallel or distributed com¬ 
puting resources to process data in parallel and then syn¬ 
chronize the results periodically or merely once in the end 
(Scott et ah, 2013; Medlar et ah, 2013; Xu et ah, 2014). 

An orthogonal approach has been studied for the 
Metropolis-Hastings (MH) algorithm in a general state 
space by running a sampler with subsampled data. This ap¬ 
proach can be combined easily with the distributed comput¬ 
ing idea for even better scalability (e.g. Ahn et ah, 2015). 

Maclaurin & Adams (2015) introduced an MH algorithm 





Scalable Discrete Sampling as a Multi-Armed Bandit Problem 


in an augmented state space that could achieve higher effi¬ 
ciency than the standard MH by processing only a subset of 
active data every iteration while still preserving the correct 
stationary distribution. But the introduction of auxiliary 
variables might also slow down the overall mixing rate in 
the augmented space. 

Approximate MH algorithms have been proposed in the 
subsampling approach with high scalability. The stochas¬ 
tic gradient Langevin dynamics (SGLD) (Welling & Teh, 
2011) and its extensions (Ahn et al., 2012; Chen et al., 
2014; Ding et al., 2014) introduced efficient proposal dis¬ 
tributions based on subsampled data. Approximate al¬ 
gorithms induce bias in the stationary distribution of the 
Markov chain. But given a fixed amount of runtime they 
could reduce the expected error in the Monte Carlo esti¬ 
mate via a proper trade-off between variance and bias by 
mixing faster w.r.t. the runtime. This is particularly impor¬ 
tant for large-scale learning problems when the runtime is 
one of the limiting factors for generalization performance 
(Bottou & Bousquet, 2008). However, the stochastic gra¬ 
dient MCMC approach usually skips the rejection step in 
order to obtain sublinear time complexity and the induced 
bias is very hard to estimate or control. 

Another line of research on approximate subsampled MH 
algorithms does not ignore the rejection step but controls 
the error with an approximate rejection step based on a sub¬ 
set of data (Korattikara et al., 2014; Bardenet et al., 2014). 
The bias can thus be better controlled (Mitrophanov, 2005; 
Pillai & Smith, 2014). That idea has also been extended to 
slice sampling (DuBois et al., 2014) and Gibbs for binary 
variables (Korattikara et al., 2014). 

In this paper we follow the last line of research and pro¬ 
pose a novel approximate sampling algorithm to improve 
the scalability of sampling discrete distributions. We first 
reformulate the problem in Eq. 1 as a Multi-Armed Ban¬ 
dit (MAB) problem with a finite reward population via the 
Gumbel-Max trick (Papandreou & Yuille, 2011), and then 
propose three algorithms with theoretical guarantees on the 
approximation error and an upper bound of N\X\ on the 
sample size. This is to our knowledge the first attempt to 
address discrete sampling problem with a large number of 
dependencies and our work will likely contribute to a more 
complete library of scalable MCMC algorithms. More¬ 
over, the racing algorithm in Sec. 3.3 provides a unified 
framework for subsampling-based discrete sampling, MH 
(Korattikara et al., 2014; Bardenet et al., 2014) and slice 
sampling (DuBois et al., 2014) algorithms as discussed in 
Sec. 4. We also show in the experiments that our algorithm 
can be combined straightforwardly with stochastic gradient 
MCMC to achieve both high efficiency and controlled bias. 
Lastly, the proposed algorithms also deserve their own in¬ 
terest for MAB problems under this particular setting. 


We first review an alternative way of drawing discrete vari¬ 
ables and build a connection with MABs in Sec. 2, then 
propose three algorithms in Sec. 3. We discuss related 
work in Sec. 4 and evaluate the proposed algorithms on 
both synthetic data and real-world problems of Bayesian 
inference and graphical model inference in Sec. 5. Particu¬ 
larly, we show how our proposed sampler can be combined 
conveniently as a building component with other subsam¬ 
pling sampler for a hierarchical Bayesian model. Sec. 6 
concludes the paper with a discussion. 

2. Approximate Discrete Sampling 

2.1. Discrete Sampling as an Optimization Problem 

The common procedure to sample X from a discrete do¬ 
main X = {1,2,..., D} is to first normalize p{X) and 
compute the CDF F(X = x ) = p(X = i). Then 
draw a uniform random variable u ~ Uniform(0,1], and 
find x that satisfies F(x — 1) < u < F(x). This proce¬ 
dure requires computing the sum of all the unnormalized 
probabilities. For f in the form of Eq. 1 this is 0(NI)). 

An alternative procedure is to first draw D i.i.d. sam¬ 
ples from the standard Gumbel distribution 1 e, ~ 
Gumbel(0,1), and then solve the following optimization 
problem: 

x = argmaxlogp(i) + £». (2) 

iex 

It is shown in Kuzmin & Warmuth (2005) that x follows the 
distribution p(X). With this method after drawing random 
variables that do not depend on p, we turn a random sam¬ 
pling problem to an optimization problem. While the com¬ 
putational complexity is the same to draw an exact sample, 
an approximate algorithm may potentially save computa¬ 
tions by avoiding computing accurate values of p(X = x) 
when x is considered unlikely to be the maximum as dis¬ 
cussed next. 

2.2. Approximate Discrete Sampling as a Multi-Armed 
Bandits Problem 

In a Multi-Armed Bandit (MAB) problem, the i’th bandit 
is a slot machine with an arm, which when pulled gener¬ 
ates an i.i.d. reward l t from a distribution associated with 
that arm with an unknown mean /j,. The optimal arm iden¬ 
tification problem for MABs (Bechhofer, 1958; Paulson, 
1964) in the fixed confidence setting is to find the arm with 
the highest mean reward with a confidence 1 —5 using as 
few pulls as possible. 

Under the assumption of Eq. 1, the solution in Eq. 2 can be 

*The Gumbel distribution is used to model the maximum ex¬ 
treme value distribution. If a random variable Z ~ Exp(l), 
then — log(Z) ~ Gumbel(0,1). e can be easily drawn as 
— log(— log(it)) with u ~ U[0,1], 
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expressed as 


3.1. Notations 


N 


x = argmax V log f n (i) + log f 0 (i ) + £i 

N ( 1 \ 
= argmax^ flog/„(i) + — (log/ 0 (*)+£*)j 

71=1 S -- ' 


= li 


1 X A 

= argmax — V l i<n = argmaxE z .^ Uniform(£i) [k\ 
iex ^ —, iex 

n— 1 
def 

= argmax ^ (3) 

iex 

def 

where £, = {Z^i, ■ ■ ■, k,N}- After drawing D Gumbel 

variables £j, we turn the discrete sampling problem into 
the optimal arm identification problem in MABs where the 
reward li is uniformly sampled from a finite population 
An approximate algorithm that solves the problem with a 
fixed confidence may avoid drawing all the rewards from 
an obviously sub-optimal arm and save computations. We 
show the induced bias in the sample distribution as follows 
with the proof in Appx. A. 1 . 

Proposition 1. If an algorithm solves (2) exactly with a 
probability at least 1 — 5 for any value of e, the total vari¬ 
ation between the sample distribution p and the true distri¬ 
bution is bounded by 

||pP0-p(X)||tv<<5 (4) 

When applied in the MCMC framework as a transition ker¬ 
nel, we can apply immediately the theories in Mitrophanov 
(2005); Pillai & Smith (2014) to show that the approximate 
Markov chain satisfies uniform ergodicity under regular 
conditions and the analysis of convergence rate are readily 
available under various assumptions. So the discrete sam¬ 
pling problem of this paper reduces to finding a good MAB 
algorithm for Eq. 2 in our problem setting. 

3. Algorithms for MABs with a Finite 
Population and Fixed Confidence 

The key difference of our problem from the regular MABs 
is that our rewards are generated from a finite population 
while regular MABs assume i.i.d. rewards. Because one 
can obtain the exact mean by sampling all the TV values 
l i n for arm i without replacement , a good algorithm should 
pull no more than TV times for each arm regardless of the 
mean gap between arms. We introduce three algorithms in 
this section whose sample complexity is upper bounded by 
O(ND) in the worst case and can be very efficient when 
the mean gap is large. 


The iteration of an algorithm is indexed by t. We denote 
the entire index set with [TV] = {1,2,..., TV}, the sampled 
set of reward indices up to f’th iteration from arm i with 
A; C [TV], and the corresponding number of sampled 
rewards with Tp 1 . We define the estimated mean for /’th 
arm with pf } = Y. nej \f('-> h,n, the natural variance 

(biased) estimate with (&P ) 2 = T, neJ ^M (k,n ~ 

pP) 2 , the variance estimate of the mean gap between two 
arm with (<rg) 2 = E neJV f> ((kn ~ h,n ) - (#1° - 

pp ))' 2 (defined only when Afp 1 = A /^), the bound of the 

def 

reward value C, = max„ jn /{Zj „ — hy}- The subscripts 
and superscripts may be dropped for notational simplicity 
when the meaning is clear from the context. 

3.2. Adapted lil’UCB 

We first study one of the state-of-the-art algorithms for 
fixed-confidence optimal arm identification problem and 
adjust it for the finite population setting. The liFUCB al¬ 
gorithm (Jamieson et ak, 2014) maintains an upper confi¬ 
dence bound (UCB) of p % that is inspired by the law of the 
iterated logarithm (LIL) for every arm. At each iteration, it 
draws a single sample from the arm with the highest bound 
and updates it. The algorithm terminates when some arm 
is sampled much more often than all the other arms. We 
refer readers to Fig. 1 of Jamieson et al. (2014) for details. 
The time complexity for t iterations is 0(\og(D)t). It was 
shown in Jamieson et al. (2014) that liTUCB achieved the 
optimal sample complexity up to constants. 

However, liTUCB requires i.i.d. rewards for each arm i, 
that is, sampled with replacement from Ci. Therefore, the 
total number of samples t is unbounded and could be A> 
ND when the means are close to each other. We adapt 
liTUCB for our problem with the following modifications; 

1. Samples without replacement for each arm but 
keep different arms independent. 

2. When Tp 1 = TV for some arm i, the estimate pp 
becomes exact. So set its UCB to pp. 

3. The algorithm terminates either with the original stop¬ 
ping criterion or when the arm with the highest upper 
bound has an exact mean estimate, whichever comes 
first. 

The adapted algorithm satisfies all the theoretical guaran¬ 
tees in Thm. 2 of Jamieson et al. (2014) with additional 
properties as shown in the following proposition with proof 
in Appx. A. 2. 
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Proposition 2. Theorem 2 of Jamieson et al. (2014) holds 
for the adapted lil’UCB algorithm. Moreover < 

N,\/i,t. Therefore, when the algorithm terminates, t = 
Y,iz X T?><ND. 

Notice that Thm. 2 of Jamieson et al. (2014) shows that t 
scales roughly as 0(1/A 2 ) with A being the mean gap and 
therefore t <C ND when the gap is large. 

3.3. Racing Algorithm for a Finite Population 

When rewards are sampled without replacement, the nega¬ 
tive correlation between rewards would generally improve 
the convergence of fii. Unfortunately, the bound in lil’UCB 
ignores the negative correlation when < N even with 
the adaptations. We introduce a new family of racing al¬ 
gorithms (Maron & Moore, 1994) that takes advantage of 
the finite population setting as shown in Alg. 1. The choice 
of the uncertainty bound function G differentiates specific 
algorithms and two examples will be discussed in the fol¬ 
lowing sections. 

Alg. 1 maintains a set of candidate set V initialized with all 
arms. At iteration t, a shared mini-batch of rn/ 11 indices are 
drawn w/o replacement for all survived arms in V. Then 
the uncertainty bound G is used to eliminate sub-optimal 
arms with a given confidence. The algorithm stops when 
only one arm remains. We require for rn ( ' t] that the total 
number of sampled indices T = Y^t=i equals N at 
the last iteration t*. Particularly, we take a doubling sched¬ 
ule TW = 2T^ _1 ) (so t* = flog 2 ^lj] + 1) an d leave 
to*- 1 ) as a free parameter. We also require G(-,T, •) =0 

whenever T = N so that Alg. 1 always stops within t* it¬ 
erations. The computational complexity for t iterations is 
O(DT^) with the marginal estimate <f, ; and 0(D 2 T W) 
with the pairwise estimate aij. The former version is more 
efficient than the latter when D is large at the price of a 
looser bound. 

Proposition 3. If G satisfies 

£ = P(3t < t*fi i (t) - n > G(S,T^ ,d {t \C)) < 5, (5) 

for any S £ (0,1) with a probability at least 1 — <5, Alg. 1 
returns the optimal arm with at most ND samples. 

The proof is provided in Appx. A. 3. Unlike adapted 
lil’UCB, Racing draws a shared set of sample indices 
among all the arms and could provide a tighter bound with 
pairwise variance estimates d t j when there is positive cor¬ 
relation, a typical case in Bayesian inference problems. 

3.3.1. Racing with Serfling Concentration 
BOUNDS FOR G 

Serfling (1974) studied the concentration inequalities of 
sampling without replacement and obtained an improved 


Algorithm 1 Racing Algorithm with a Finite Reward Pop¬ 
ulation_ 

input Number of arms D, population size N, mini-batch 
sizes confidence level 1 — 5, uncertainty 

bound function G(S, T, d. C ), range of samples C, (op¬ 
tional). 

t 4- 0 , T 4- 0 , D <- { 1 , 2 ,..., D}, Jsf <- 0 

while \D\ > 1 do 

t 4 — / T 1 

Sample w/o replacement indices M C [_/V]\A/\ 
and set N 4— /V U A4 , T 4— T + m W 
Compute £ T>,n £ A4, and update fii and ay 

(or dij), Vi £ V. 

Find the best arm x 4— argma x iGV fii 
Eliminate sub-optimal arms when the estimated gap is 
large V 4— T>\{i : fi x - fii > G{^,T,d x ,C x ) + 
G(jy,T, di, Gf)} (or V 4— V\{i : fi x - fii > 
G( , T, dx^i), G x + Ci}) 
end while 
output V 


Hoeffding bound. Bardenet & Maillard (2013) extended 
the work and provided an empirical Bernstein-Serfling 
bound that was later used for the subsampling-based MH 
algorithm (Bardenet et al., 2014): for any S £ (0,1] and 
any n < N, with probability 1 — S, it holds that 


2 p n log(5/5) kC log(5/5) 

n n 

= B EBS (5,n,d n ,C) (6) 

where k = | + -jp=, and p n = 

1 — 7r n _i if n < N/2 . def n 

(l-7T„)(l + i) if n > N/2 ’ W ™ 7F " - N' 

The extra term p n that is missing in regular empirical 
Bernstein bounds reduces the bound significantly when 
n is close to N. We set m ^ = 2 in Alg. 1 to provide a 
valid for any t and set the uncertain bound G with the 
empirical Bernstein-Serfling (EBS) bounds as 

G EBS (5,T, a, C ) = B ebs T, d, C^j (7) 

It is trivial to prove that Gebs satisfies the condition in Eq. 5 
using a union bound over t <t*. 

3.3.2. Racing with a Normal Assumption for G 

The concentration bounds often give a conservative strat¬ 
egy as they assume an arbitrary bounded reward distribu¬ 
tion. When the number of drawn samples is large, the cen¬ 
tral limit theorem suggests that fif4) follows approximately 
a Gaussian distribution. Korattikara et al. (2014) made such 
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an assumption and obtained a tighter bound. We first pro¬ 
vide an immediate corollary of Prop. 2 in Appx. A of Ko- 
rattikara et al. (2014). 

Corollary 4. Let p^l it , t = 1,2,... ,t* be the estimated 
means using sampling without replacement from any finite 
population with mean p and unit variance. The joint nor¬ 
mal random variables pM') that match the mean and co- 
variance matrix with follow a Gaussian random walk 
process as 

p t ,(p {t) \p {1) ,...,p {t ~ 1 '>) = S t ) ( 8 ) 

where m t = p + A t (p t _ 1 - p), S t = ^fy (l - ) . 

A * = lTd-nZir Bt = with Wt sh ° rt f° r *TPh 

Remark 5. The marginal distribution p(p^) 

A ^/i, ^1 — t ' n ^_~ 1 ^ where the variance approaches 

0 when —> TV. 

Assumption 6. When l,Vi, we assume n (t) « a 

and the central limit theorem suggests that the joint distri¬ 
bution of p^ /<jV) can be approximated by the joint distri¬ 
bution of p^\ 


With the normal assumption, we choose the uncertainty 
bound G in the following form 


l-fNormal(^: (7) 


VT 


i - 


T- 1 
TV- 1 


1/2 

-^Normal (9) 


Intuitively we use a constant confidence level, $(f?N orma i), 
for all marginal distributions of p W over t where $(•) is 
the CDF of the standard normal. To choose the constant 
^Normal, we plug ^Normal into the condition for G in Eq. 5 
and apply the normal distribution (8) to solve the univari¬ 
ate equation £(B) = 5. This way of computing G gives a 
tighter bound than applying the union bound across t as in 
the previous section because it takes into account the cor¬ 
relation of mean estimates across iterations. Appx. B pro¬ 
vides a lookup table and a plot of f?Normai(<5) = £ _1 (A). 
Notice that /j Nonn;li only needs to be computed once and 
we can obtain it for any 6 by either interpolating the table 
or computing numerically with code to be shared (runtime 
< 1 second). For the parameter of the first mini-batch size 
ml 1 ', a value of 50 performs robustly in all experiments. 


We provide the sample complexity below with the proof 
in Appx. A.4. Particularly, T*( A) -A DN as A -> 
0, and T*(A) = Dm^ when A > 2 B Normal (5/D') 
s/(N/mA) -1)/(TV- 1). 

Proposition 7. Let x* be the best arm and A be the mini¬ 
mal normalized gap of means from other arms, defined as 
min i^ x * when using marginal variance estimate <7, 

and minjjj;* when using paim’ise variance esti- 

mate d x ,i- If As sump. 6 holds, with a probability at least 


1 — S Racing-Normal draws no more rewards than 


T*( A) = D 


N 


(TV — 1) 


A 2 


4 Bl 


,i(<5 /D’) 


+ 1 


i(i) 


( 10 ) 


where \n} m = m 2 ^°S 2 "/™1 A TV > n, Vn < TV. D' = D 
if using di and is D — 1 if using d x ,i- 


3.4. Variance Reduction for Random Rewards with 
Control Variates 

The difficulty of MABs depends heavily on the ratio of the 
mean gap to the reward noise, A. To improve the sig¬ 
nal noise ratio, we exploit the control variates technique 
(Wilson, 1984) to reduce the reward variance. Consider a 
variable /ij >n whose expectation E„^,[a r][hi, n \ can be com¬ 
puted efficiently. The residue reward l i n — h i n +E n [/ij n ] 
has the same mean as l, n and the variance is reduced if 
hi, n ~ h,n- In the Bayesian inference experiment where 
the factor f n (X = i) = p(y n \X = i ), we adopt a similar 
approach as Wang et al. (2013) and take the Taylor expan¬ 
sion of li tn around a reference point y as 

1 def 

U,u ~ k(y)+gj(.y n -y)+-(y n -y) T Hi(y n -y) = hp n 

(ID 

where g, and Hi are the gradient and Hessian matrix of 
logp(y|i) respectively evaluated at y. E[/i i n ] can com¬ 
puted analytically with the first two moments of y n . A 
typical choice of y is E[y]. 

The control variate method is mostly useful for Racing- 
Normal. For algorithms depending on a reward bound C in 
order to get a tight bound for l i n — h l n it requires a more 
restrictive condition for C as in Bardenet et al. (2015) and 
we might end up with an even more conservative strategy 
in general cases. 


4. Related Work 

The Gumbel-Max trick has been exploited in Kuzmin & 
Warmuth (2005); Papandreou & Yuille (2011); Maddison 
et al. (2014) for different problems. The closest work is 
Maddison et al. (2014) where this trick is extended to draw 
continuous random variables with a Gumbel process, rem¬ 
iniscent to adaptive rejection sampling. 

Our work is closely related to the optimal arm identification 
problem for MABs with a fixed confidence. This is, to our 
knowledge, the first work to consider MABs with a finite 
population. The proposed algorithms tailored under this 
setting could be of interest beyond the discrete sampling 
problem. The normal assumption in Sec. 3.3.2 is similar to 
UCB-Normal in Auer et al. (2002) but the latter assumes a 
normal distribution for individual rewards and will perform 
poorly when it does not hold. 
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The bounds in Sec. 3.3 are based on subsampling-based 
MH algorithms in Bardenet et al. (2014); Korattikara et al. 
(2014). The proposed algorithm extends those ideas from 
MH to discrete sampling. In fact, let x and x' be the current 
and proposed value in an MH iteration, Racing-EBS and 
Racing-Normal reduce to the algorithms in Bardenet et al. 
(2014) and Korattikara et al. (2014) respectively if we set 

X = {x,x'}, /o(l) = up{x)q{x'\x), 

/o(2) = p{x')q(x\x'), f n (x)=p( y n \x) (12) 

where p(x) is the prior distribution, u ~ Uniform[0,1] 
and q(- 1-) is the proposal distribution. The difference with 
Bardenet et al. (2014) is that we distribute the error S 
evenly across t in Eq. 7 while Bardenet et al. (2014) set 
St = (p — 1 )/(p(T^) p )S with p a free parameter. The 
differences with Korattikara et al. (2014) are that we take a 
doubling schedule for m^ and replace the t-test with the 
normal assumption. We find that our algorithms are more 
efficient and robust than both original algorithms in prac¬ 
tice. Moreover, the binary Gibbs sampling in Appx. F of 
Korattikara et al. (2014) is also a special case of Racing- 
Normal with D = 2. Therefore, Alg. 1 provides a unifying 
approach to a family of subsampling-based samplers. 

The variance reduction technique is similar to the proxies 
in Bardenet et al. (2015), but the control variate here is a 
function in the data space while the proxy in the latter is 
a function in the parameter space. We do not assume the 
posterior distribution is approximate Gaussian and our al¬ 
gorithm works with multi-modal distributions. 

It is important not the confuse the focus of our algorithm for 
the big N problem in Eq. 1 with other algorithms that ad¬ 
dress sampling for a large state space (big D ) or similarly 
a high-dimensional vector of discrete variables (exponen¬ 
tially large D). The combination of these two approaches 
for problems with both big N and big D is possible but 
beyond the scope of this paper. 

5. Experiments 

Since this is the first work to discuss efficient discrete sam¬ 
pling for problem (1), we compare the adapted lil’UCB, 
Racing-EBS, Racing-Normal with the exact sampler only. 
We report the result of Racing-Normal in real data experi¬ 
ments only as the speed gains of the other two are marginal. 

5.1. Synthetic Data 

We construct a distribution with D = 10 by sampling 
N = 10 5 rewards of for each state from one of the three 
distributions A/”(0,1), Uniform[0,1], LogNormal(0,2). 
We normalized l i n to have a fixed distribution p(X) in 
Fig. 1(a) and a reward variance er 2 that controls the dif¬ 
ficulty. The normal distribution is the ideal setting for 


Racing-Normal, and the uniform distribution is desirable 
for adapted lil’UCB and Racing-EBS as the reward bound 
is close to a. The LogNormal distribution, whose ex. kur- 
tosis ss 4000, is difficult for all due to the heavy tail. We 
use a tight bound C = max{lj in — for Racing-EBS. 
We set the scale parameter of adapted lil’UCB with C/2 
and other parameters with the heuristic setting in Jamieson 
et al. (2014). Racing uses the pairwise variance estimate. 

Fig. l(b)-(d) show the empirical error of best arm iden¬ 
tification by drawing 10 4 samples of X for each setting 
and vary the target error bound 6 £ [10 —3 ,0.1]. The 
bound appears very loose for lil’UCB and Racing-EBS but 
is sharp for Racing-Normal when the noise is large (1(b)) 
and 5 <C 1. This is consistent with the direct comparison 
of uncertainty bounds in Fig. 1(e). Consequently, given the 
same error tolerance 5 Racing-Normal requires much fewer 
rewards than the other conservative strategies in all the set¬ 
tings except when a = 10 -5 and l l:U ~ Uniform[0,1], 
as shown in Fig. l(f)-(h). We verify the observations with 
more experiments in Appx. C.l with D £ {2,100} and 
marginal estimate by. 

Surprisingly, Racing-Normal performs robustly regard¬ 
less of reward distributions with the first mini-batch size 
to*- 1 ) = 50 while it was shown in Bardenet et al. (2014) 
that the algorithm with the same normal assumption in Ko¬ 
rattikara et al. (2014) failed with LogNormal even when 
mX 1 = 500. The dramatic improvement in robustness is 
mainly due to our doubling scheme where central limit the¬ 
orem applies quickly with increasing exponentially. 
We do not claim that the single trick will solve the prob¬ 
lem completely because there still exist cases in theory with 
extremely heavy-tailed reward distributions where our nor¬ 
mal assumption does not hold and the algorithm will fail to 
meet the confidence level. In practice, we do not observe 
that pathological case in any of the experiments. 

5.2. Bayesian ARCH Model Selection 

We evaluate Racing-Normal in a Bayesian model selection 
problem for the auto-regressive conditional heteroskedas- 
ticity (ARCH) models. The discrete sampler is integrated 
in the Markov chain as a building component to sample the 
hierarchical model. Specifically, we consider a mixture of 
ARCHs for the return r t of stock price series with student-t 
innovations, each component with a different order q: 

9 

iid . {c\ -i \ 2 i \ r 2 

n = cr t z t , zt ~ t„( 0,1), a t = a 0 + ^ 

i= 1 

q ~ Discrete (7r), ay, v Gamma(l, 1) 

where 7r = {n q : q £ Q} is the prior distribution of a can¬ 
didate model in the set Q. The random variables to infer 
include the discrete model choice q and continuous param- 
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Figure 1. Synthetic data. ((b),(c),(d)) Estimated error with 95% confidence interval. Plots not shown if no error occured. ((f),(g),(h)) 
proportion of sampled rewards. is sampled from Normal (x), Uniform (O) and LogNormal (□) distributions. Plots of Racing- 
Normal overlap in ((f),(g).(h)). 


eters {cti}f =0 ,^. We adopt the augmented MCMC algo¬ 
rithm in Carlin & Chib (1995) to avoid transdimensional 
moves. We apply subsampling-based scalable algorithms 
to sample all variables with subsampled observations {r t }: 
Racing-Normal Gibbs for q, stochastic gradient Langevin 
dynamics (SGLD) (Welling & Teh, 2011) corrected with 
Racing-Normal MH (Sec. 4) for a, and v. We use adjusted 
priors n q as suggested by Carlin & Chib (1995) for suffi¬ 
cient mixing between all models and tune them with adap¬ 
tive MCMC. The adjusted posterior p(q |r) oc Tr q p(r\q) is 
then close to uniform and the value 7r g /7r g provides an es¬ 
timate to the real unnormalized posterior p{q |r). Control 
variates are also applied to reduce variance. Details of the 
sampling algorithm are provided in Appx. C.2. 

We apply the model on the 5-minute Shanghai stock ex¬ 
change composite index of one year consisting of about 
13,000 data points (Fig. 2(a)). Q = {5,10,15, 20, 25, 30}. 
We set mrS 1 ’ = 50 and S = 0.05. The control variate 
method reduces the reward variance by 2~3 orders of mag¬ 
nitude. Fig. 2(b) shows the estimated log-posterior of q by 
normalizing Tr q /n q in the adaptive MCMC as a function 
of the number of likelihood evaluations (proportional to 
runtime). The subsampling-based sampler (Sub) converges 
about three times faster. We then fix 7 t 9 for a fixed station¬ 
ary distribution and run MCMC for 10 5 iterations to com¬ 
pare Sub with the exact sampler. The empirical error rates 
for Racing-Normal Gibbs and MH are about 4 x 10 4 and 
2 x 10 -3 respectively. Fig. 2(c) shows estimated adjusted 
posterior with 5 runs, and (d) compares the auto-correlation 
of sample q. Sub obtains over twice the effective sample 


size without noticeable bias after the burn-in period. 

5.3. Author Coreference 

We then study the performance in a large-scale graphical 
model inference problem. The author coreference prob¬ 
lem for a database of scientific paper citations is to clus¬ 
ter the mentions of authors into real persons. Singh et al. 
(2012) addressed this problem with a conditional random 
field model with pairwise factors. The joint and conditional 
distributions are respectively 

Pe( y|x) oc exp fo{Xi,Xj) I , 

\ 2 H=Vj,i¥^3^hO ) 

Pe{Yi = t/i|y_i,x) oc exp ^ f g (xi,Xj) 

\yj£C y ={j:y j =y,j=£i} 

where x = {a^}^ is the set of observed author mentions 
and yi £ N + is the cluster index for <’th mention. The fac¬ 
tor fg(xi,Xj) measures the similarity between two men¬ 
tions based on author names, coauthors, paper title, etc, pa¬ 
rameterized by 6. In the conditional distribution, y, can 
take a value of any non-empty cluster or another empty 
cluster index. When a cluster C y contains a lot of men¬ 
tions, a typical case for common author names, the number 
of factors to be evaluated N y = \C y \ will be large. We 
consider the MAP inference problem with fixed 6 using 
annealed Gibbs sampling (Finkel et al., 2005). We apply 
Racing-Normal to sample Y t by subsampling C y for each 
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(a) Stock index return rt 


(b) Estimated log p(qjr) 


0.2 



q 

(c) Adjusted post. p(q j r) 



(d) Auto-correlation of q 


Figure 2. Bayesian ARCH Model Selection. Solid: exact, dashed: approximate using 
Sub Gibbs + SGLD with Sub MH. 



(a) B' 1 ' F-l vs #factor evaluations 



(b) B 3 F-l vs iteration 


Figure 3. Author Coreference. Bigger B 3 
F-l score is better. 


candidate value y. An important difference of this problem 
from Eq. 1 is that N y ^ N y > ,\/y ^ y' and N y has a heavy 
tail distribution. We let the mini-batch size depend on N y 
with details provided in Appx. C.3. 

We run the experiment on the union of an unlabeled DBLP 
dataset of BibTex entries with about 5M authors and a Rexa 
corpus of about 1 IK author mentions with 3160 entries la¬ 
beled. We monitor the clustering performance on the la¬ 
beled subset with the B 3 F-l score (Bagga & Baldwin, 
1998). We use 5 = 0.05 and the empirical error rate is 
about 0.046. The number of candidate values D varies in 
2 ~ 215 and N y varies in 1 ~ 1829 upon convergence. 
Fig. 3(a) shows the F-l score as a function of the num¬ 
ber of factor evaluations with 7 random runs for each algo¬ 
rithm. Sub Gibbs converges about three times faster than 
exact Gibbs. Fig. 3(b) shows F-l as a function of iterations 
that renders almost identical behavior for both algorithms, 
which suggests negligible bias in Sub Gibbs. The relative 
number of the evaluated factors of sub to exact Gibbs indi¬ 
cates about a 5-time speed up near convergence. The initial 
speed up is small because every cluster is initialized with a 
single mention, i.e. N y = 1. 

6. Discussion 

We consider the discrete sampling problem with a high de¬ 
gree of dependency and proposed three approximate al¬ 


gorithms under the framework of MABs with theoretical 
guarantees. The Racing algorithm provides a unifying ap¬ 
proaches to various subsampling-based Monte Carlo algo¬ 
rithms and also improves the robustness of the original MH 
algorithm in Korattikara et al. (2014). This is also the first 
work to discuss MABs under the setting of a finite reward 
population. 

Empirical evaluations show that Racing-Normal achieves 
a robust and the highest speed-up among all competitors. 
Whilst adaptive lil’UCB shows inferior empirical perfor¬ 
mance to Racing-Normal, it has a better sample complex¬ 
ity w.r.t. the number of arms D. It will be a future direc¬ 
tion to combine the bound of Racing-Normal with other 
MAB algorithms including lil’UCB for a better scalabil¬ 
ity in D. Another important problem is on how to relax 
the assumptions for Racing-Normal without sacrificing the 
performance. 

It would also be an interesting direction to extend our work 
to draw continuous random variables efficiently with the 
Gumbel process (Maddison et al., 2014). In continuous 
state space, there are infinitely many “arms” and a naive 
application of our algorithm will lead to infinitely large er¬ 
ror bound. This problem can be alleviated with algorithms 
for contextual MAB problems. 
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A. Proofs 

A.l. Proof of Prop. 1 

Proof. For a discrete state space, the total variation is 
equivalent to half of L\ distance between two probability 
vectors. Denote by p(X = %\e) the distribution of the out¬ 
put of the approximate algorithm conditioned on the vec¬ 
tor of Gumbel variables e, and x(e) the solution of Eq. 2 
as a function of e. According to the premise of Prop. 1, 
p(X = a;(e)|£) > 1 — <5, Ve. We can bound the L\ error of 
the conditional probability as 

Y I P( X = *l £ ) - <*i,s(e)| 

iex 

= \p(X = x(£)\e) - 1| + Y \p( x = i\s)\ < 2<5, Ve 


where Si j is the Kronecker delta function. Then we can 
show 


M x ) -pPOHtv 

= \ Y I p( x = = *)l 

iex 


_ 1 
" 2 

1 

< 

“ 2 



(p(X = i\e) - 5^ x(e) ) dP(e) 


/ |p( A ' = *l £ )-^,r C (e)|dP(£) 

iex 


1 

2 


Y I p( x = *l £ ) - K 


x(e) | 


d P(e) 


\iex 


< S 


(14) 


□ 


A.2. Sketch of the proof of Prop. 2 

Proof. As the proof of this proposition is almost identical 
to the proof of Jamieson et al. (2014), we only outlines the 
difference due to the adaptation. In the proof of Thm. 2 
in Jamieson et al. (2014), the i.i.d. assumption for rewards 
from each arm was used only in Lemma 3 to provide Cher- 
noff’s bound and Hoeffding’s bound. As noted in Sec. 6 
of Hoeffding (1963) those bounds would still hold when 
rewards are sampled from a finite population without re¬ 
placement. Therefore, when r P l ' t) < N all the bounds hold 
for adapted lil’UCB. 

When if) = N, the second modification sets the upper 
bound of the mean estimate to /)(*). That is a valid upper 
bound of /i,, in fact much tighter than the bound in the orig¬ 
inal algorithm because Y' 1 = Pi exactly when the entire 
population is observed. 


Therefore, as long as < N. Vi, Theorem 2 in Jamieson 
et al. (2014) applies to adapted liFUCB with modification 
1 and 2 only. 

With the third modification, could never be bigger 
than N at the stopping time, which proves the second part 
of Prop 2. The proof can then be concluded if we can 
show modification 3 does not change the output of adapted 
liFUCB with the first two modifications only. This is true 
because if we do not stop when the selected arm i satisfies 
7'5 = N, we do not need to update the upper bound of 
i because the estimated mean is already exact. Since no 
upper bound is changed, the arm i will always be chosen 
for now on and eventually the original stopping criterion of 
T- t ' > > 1 + A Yljzti Tj(t) is met an d the same arm i will be 
returned. □ 

A.3. Proof of Prop. 3 

Proof. Denote by x <t! the arm with the highest estimated 
mean at iteration t and x* the optimal arm with the highest 
true mean, p x - > Pi,\/i ^ x*. If Alg. 1 does not stop 
in the first t* — 1 iterations, the estimated means of all the 
survived arms become exact at the last iteration t*, pf = 

Pi because we require T ( 4 ) = N. Then = x*. As 
we require G(5,T = N,a,C) = 0 ,V5,a,C, all the sub- 
optimal arms will be eliminated by the last iteration and the 
algorithm always returns the correct best arm. This proves 
the upper bound of the sample size of ND. 

Now to prove the confidence level, all we need to show is 
that with at least a probability 1 — <5 arm x* survived all the 
iterations t < t*. 

Let us first consider the case when Alg. 1 uses the marginal 
variance estimate & Let the events 

Ei = jit < Pi > G J ,V* / x* 

E x * = 1 3 1 < - {-p x .) > G (^, TW, j 

(15) 

Applying condition Eq. 5 and the union bound, we get 
P(Ui e xEi) < Y^iex Ei = 8. So with a probability at 
least 1 — 5, none of those events will happen. In that case 
for any iteration t < t*, 

px px* z (Px px) (.Px* px*) + (.Px px*) 

< G , T«, Y } , Cx) + G (^, TW, ol 1 }, Gv) 

(16) 


So arm x* won’t be eliminated at iteration /. 

Similarly, for the case when Alg. 1 uses the pairwise vari- 
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ance estimate crx, let the events 

Ei,x = j^f < t*, (Ai 4) - Ax*) - {Pi - Mx*) 

>G^^,tW frf.Ci + a,.) },Vz^x* 

(17) 

Applying condition Eq. 5 and the union bound, we get 
X\{x*}Ei,x) 5: = ^0 with a 

probability at least 1 — <5 for any iteration t < t*, 

f^x fax* = (Arc Arc*) (/Ae f^x *) H - (ftx l^x *) 

<g(-^ i ,tW,^ ) x „G 3; +C'A (18) 

Therefore arm i: ! won’t be eliminated at iteration t. □ 

A.4. Proof of Prop. 7 

Proof. Denote by the arm with the highest estimated 
mean at iteration t. First consider the case when Alg. 1 uses 
the marginal variance estimate With the condition in 
Eq. 5, it follows that P(Ui £ xEi ) < E «= x P(Ei) < 5 
where Ei is defined in Eq. 15. So with a probability at least 
1-5, 

- A f } >/i x * - m - G (^,TV,a { $,C x }j 

(19) 

Alg. 1 will stop by iteration t. if the RHS of the equation 
above satisfies the stopping criterion for all i f x*, that is, 

Ms*-/A >2 (g(^,T^,&^,C x ^ 

+ G^,TU,a ( ? ) ,C^,\/i^x* 

( 20 ) 


Since we use a doubling schedule = 2T^ -1 ! with 
7 1 ! 1 ) = rn^ and 1 = N, Alg. 1 stops at an iteration no 
later than 

i=ri Og2 (f/m< 0 >)l+l (23) 

And the total number of samples drawn by t is upper 
bounded by D(m(°^2 t_1 A TV) = T*( A). 

Now consider the case when Alg. 1 uses the pairwise 
variance estimate dp\. With the condition in Eq. 5, it 
follows with the union bound that P{Ui£x\{x*}Ei) < 
Ejgat\{x*} E{Ei) < 5 where E\ is defined in Eq. 17. So 
with a probability at least 1 — <5, 

„(t) „(t) 

Mx* - Pi 

> fix.-m-G + C^j ,Vt ^ ^ 

(24) 

Now we can follow a similar argument as in the case with 
marginal variance estimate and prove the proposition. □ 

B. Table and Figure of B-^ orraSL i(S, n TW ) 

Table 1 shows i?Normai(<), TTrt 1 )) with ^ varying in 
[10 —6 ,0.49]. and the proportion of the first mini-batch 
7 t t(1 , = ml 1 ! /N £{5 x 1CT 5 ,10" 4 , 5 x 1(T 4 ,1CT 3 , 5 x 
10~ 3 ,10 -2 }. <I>(/J) can be interpreted as the marginal con¬ 
fidence level for one iteration. The function is also shown 
in Fig. 4 for visualization. We will release the code to gen¬ 
erate the table and to compute f?Normai(<), ^rc 1 )) numeri¬ 
cally. 

C. Experiment Detailed Setting and Extra 
Results 

C.l. More Results of the Synthetic Data Experiment 

The results with the marginal variance estimate o t for Rac¬ 
ing are shown in Fig. 5. The Racing algorithms (both 
EBS and Normal) performs more conservatively compared 
to the plots when using pairwise variance estimate &ij in 
Fig. 1, but the relative performance of all the algorithms are 
very similar to Fig. 1. 


Plugging in the definition of GNormai in Eq. 9 and applying 
the assumption op = Oi, we will get 


Mx* ^ Pi L _ T (t) - 1 

(o-x* + Oi) VfW V N - 1 


1/2 

-^Normal; Vz f % 
( 21 ) 


Solve the above inequality for and use the definition 
of the gap A we get 


rp(i) 


> 


N 


(N-l) 


A 2 


4 Bl 


fS/D) 


+ 1 


def 


( 22 ) 


We also provide the results with D = 2 and D = 100 
when Racing algorithms use pairwise variance estimate in 
Fig. 6 and 7 respectively. Racing-Normal performs the best 
in all situations and the empirical error never exceeds the 
provided bound 5 with a statistical significance of 0.05. 

Notice that the error of adaptive lil’UCB exceeds the er¬ 
ror tolerance in the experiment with D = 100 and A,n ~ 
Uniform[0,1]. This is because we use the recommended 
heuristic setting of parameters in Jamieson et al. (2014) 
that unfortunately does not satisfy the theoretical guaran¬ 
tee of Thm. 2 in Jamieson et al. (2014). lil’UCB (heuristic) 
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Table 1 . TlNormal^i ) 

tttC 1 ) 

5 x icr 5 Ter 3 5 x io ~ 4 Ter 3 5 x i0" a io ~ 2 


s 

1.0e-06 

3.0e-06 

5.0e-06 

7.0e-06 

9.0e-06 

1.0e-05 

3.0e-05 

5.0e-05 

7.0e-05 

9.0e-05 

1.0e-04 

3.0e-04 

5.0e-04 

7.0e-04 

9.0e-04 

1.0e-03 

3.0e-03 

5.0e-03 

7.0e-03 

9.0e-03 

1.0e-02 

3.0e-02 

5.0e-02 

7.0e-02 

9.0e-02 

l.Oe-Ol 

1.3e-01 

1.6e-01 

1.9e-01 

2.2e-01 

2.5e-01 

2.8e-01 

3.1e-01 

3.4e-01 

3.7e-01 

4.0e-01 

4.3e-01 

4.6e-01 

4.9e-01 


5.27250 

5.06504 

4.96669 

4.89969 

4.85078 

4.82952 

4.60397 

4.49660 

4.42331 

4.36853 

4.34343 

4.09380 

3.97189 

3.88945 
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performed significantly better than the setting with guar¬ 
antees in Jamieson et al. (2014). So we expect that adap¬ 
tive lil’UCB with parameters satisfying Thm. 2 of Jamieson 
et al. (2014) will perform significantly worse than adaptive 
lil’UCB (heuristic) and Racing-Normal in terms of the re¬ 
ward sample complexity. 

C.2. Details of the Bayesian ARCH Model Selection 
Experiment 

An ARCH model is commonly used to model the stochastic 

def 

volatility of financial times series. Let r t = \og(p t /pt-i) 
be the logarithm return of some asset price p t at time / . We 
assume a constant mean process for the return and remove 
the estimated mean in a pre-process step. An important 
problem in applying ARCH for financial data is to choose 
the complexity, the order q of the auto-regressive model. 
We treat the model selection problem as a Bayesian infer¬ 
ence problem for the random variable q. We use a uniform 
prior distribution, n{q) = 1/|Q|. 

An MCMC algorithm was introduced in Carlin & Chib 
(1995) to infer the posterior model distribution by aug¬ 
menting the parameter space to a complete parameter set 
for all models ((a^)^_ 0 , v^),j £ Q, then assigning the 
regular prior for the selected model j = q and pseudopriors 
for those models that are not selected j / q. Then regular 
MCMC algorithms can be applied to sample all the random 
variables q , v^)j without the problem of transdi- 


mensional moves as in reversible jump MCMC. 

The mixing rate of Carlin & Chib (1995) depends on a 
proper choice of the pseudoprior for Ideally 

it should be similar to the parameter posterior when the 
model is chosen p(atf \ u^)\q = j , r). We first reparam¬ 
eterize {o^\ iX ? )) with a softplus function x = log(l + 
exp(a/)) to allow a full support along the real axis and then 
take the Laplace approximation at the MAP of transformed 
parameters as the pseudoprior for each model separately. 

In order to avoid accessing the entire dataset each itera¬ 
tion, we use subsampling-based algorithms to sample all 
the conditionals except the pseudoprior as follows 

g|(a w ,i/ w )j ~ 7r(g) JJp(r t |a (9) ,r t _ g:t _i,i/ (9) ), 

t 

(a. (q \v (q) )\q ~ p(a (9) )p(^ (<?) ) v 

t 

(a w ,i/ w )|g ~ PpseudopriorCa^^^Vj ± q, (25) 

where we sample q with Racing-Normal Gibbs and sample 
q,(<3) ; j/'D us i n g MH with a proposal from SGLD and a re¬ 
jection step provided by Racing-Normal MH. The rejection 
step controls the error introduced in SGLD when the step 
size is large. 

As the marginal likelihood for each model could be differed 
by a few orders of magnitudes, to make sure every model is 
sampled sufficiently often, we first adjust the prior distribu- 












Scalable Discrete Sampling as a Multi-Armed Bandit Problem 


—8 



CL 

E 


10“ 


■Racing-Normal 
Racing-EBS 
■Adapted LIL'UCB 




io “ 3 io “ 2 io _1 

Error Torelance8 


(a) a = 0.1, very hard 


o 

UJ 

15 


CL 

E 

LU 


10 “’ 
1 o “ 2 
1 o“ 3 


io“ 3 1 o“ 2 10- 1 

Error TorelanceS 

(b) a = 10 -4 , easy 


o 

LU 

15 


CL 

E 

LU 


i o 1 
i o “ 2 
10“ 3 


io“ 3 io“ 2 10“' 

Error TorelanceS 

(c) a = 10 -5 , very easy 



(d) a = 0.1 
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(e) a = 10 4 , in log scale 



(f) a = 10 5 , in log scale 


Figure 5. Synthetic data. D = 10. Racing uses marginal variance estimate <7i. ((a),(b),(c)) Estimated error with 95% confidence interval. 
Plots not shown if no error occured. ((d),(e),(f)) proportion of sampled data, log f n (i ) is sampled from Normal (x). Uniform (O) and 
LogNormal (□) distributions. Plots of Racing-Normal overlap in ((f),(g),(h)). 


tion 7r with the Wang-Landau algorithm with an annealing 
adaptation on log7r, 1/(1 + f/100), so that the posterior 
distribution p(q |r) is approximately uniform. We then fix 
7 f and compare the exact and approximate MCMC algo¬ 
rithms. The real posterior distribution can be computed as 
P(q |r) (xp(q\r)/f(q). 

We choose the step size separately for the exact and 
stochastic gradient Langevin dynamics (Welling & Teh, 
2011) so that the acceptance rate is about 36%. 

We apply the control variates by first segmenting the 2-D 
space of z j<t = f (r t ,a^ + where 

takes the MAP value, equally into 100 bins according to 
marginal quantiles and then taking the reference points at 
the mean of each bin. We also notice that some data points 
have large residual reward /, ,, — /t v „ when z J t is far from 
the reference point. We take 20% of the points with the 
largest distance in z as outliers, always compute them every 
iteration and apply the subsampling algorithm for the rest 
data. 


1. \C y \ ^ \C y ’\ and the distribution of the cluster size 
follows approximately a power law with the value 
varying from as small as 1 to thousands. If we set 
rrS 1 ' > = 50 as usual, we already draw about 33% of 
all the rewards in the first mini-batch. So we slightly 
abuse the Normal assumption and use a small size for 
= 3 and use doubling scheme for the rest with 
m.y = (\C y \ — 3)/10 A 1. The experiment shows an 
empirical error 0.045 of mis-identification of the best 
arm with the provided bound d = 0.05. 


2. The distribution of {fg(xi,Xj) : j £ C y } is inde¬ 
pendent from different clusters/arms. We exploit the 
independence of rewards and choose the bound 


G 


Normal 


(<5) frj) 



- 1/2 


-^Normal- 


(26) 


C.3. Details of the Author Coreference Experiment 

The main differences of this sampling problem from Eq. 1 
are that 


This modification has the same performance as with 
the pairwise variance estimate and has the same com¬ 
putational complexity as with the marginal variance 
estimate O(DN). We compute f?Normai with a sub- 





























Scalable Discrete Sampling as a Multi-Armed Bandit Problem 
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(a) u = 0.1, very hard 
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(b) a = 10~ 4 , easy 
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(c) cr = 10 -5 , very easy 



(d) cr = 0.1 
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(f) a = 10 5 , in log scale 


Figure 6. Synthetic data. D = 2. Racing uses pairwise variance estimate cy,. ((a),(b),(c)) Estimated error with 95% confidence interval. 
Plots not shown if no error occured. ((d),(e),(f)) proportion of sampled data, log f n (i ) is sampled from Normal (x). Uniform (O) and 
LogNormal (□) distributions. Plots of Racing-Normal overlap in ((f),(g),(h)). 



(a) cr = 0.1, very hard 




(b) a = 10 4 , easy (c) cr = 10 5 , very easy 



(d) cr = 0.1 
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Figure 7. Synthetic data. D = 100. Racing uses pairwise variance estimate dyy ((a),(b),(c)) Estimated error with 95% confidence 
interval. Plots not shown if no error occured. ((d),(e),(f)) proportion of sampled data, log f n (i) is sampled from Normal (x), Uniform 
(O) an d LogNormal (□) distributions. Plots of Racing-Normal overlap in ((f),(g),(h)). 
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optimal but simpler choice as 

£ Normal (<5) = (l - ■ (27) 

It is easy to show that Eq. 5 still holds in this case 
using a union bound across t. The bound in Eq. 27 
is strictly looser than i?Normai = £ _1 (c>) but the dif¬ 
ference is small when ) < 1 and diminishes to 0 as 

<5 —)• 0 . 

We obtained the dataset from the authors of Singh et al. 
(2012) but it is different from what is used in Singh et al. 
(2012) with more difficult citations. The best B 3 F-l score 
reported in this paper is a reasonable value for this data set 
according to personal communications with the authors of 
Singh et al. (2012). 




